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ABSTRACT 

With  Machine  Translation  and/or  Automatic  Speech  Recog¬ 
nition,  there  can  be  different  versions  of  the  same  data  with 
distinct  expressions.  We  argue  that  combining  evidence 
from  these  “homologous”  datasets  can  give  us  better  rep¬ 
resentation  of  the  original  data,  and  our  experiments  show 
that  a  model  combining  all  sources  outperforms  each  indi¬ 
vidual  dataset  in  retrieval. 
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I.  INTRODUCTION 

Nowadays  many  information  retrieval  collections  are  not 
limited  to  a  single  language,  like  those  in  the  Text  REtrieval 
Conference  (TREC),  with  data  in  other  languages  machine 
translated  (MT)  into  English.  For  projects  focused  on  news 
like  Topic  Detection  and  Tracking  (TDT),  data  may  also 
come  from  multiple  media  (newswire,  web,  radio,  TV,  etc.), 
and  broadcast  news  often  comes  with  transcripts  from  Auto¬ 
matic  Speech  Recognition  (ASR).  The  variety  of  sources  and 
MT/ASR  systems  results  in  different  versions  of  the  same 
dataset,  and  their  quality  varies.  Extensive  experiments  are 
usually  required  to  pick  the  “best”  version  that  achieves  the 
highest  accuracy  in  information  retrieval. 

Without  enough  relevance  judgment,  it  is  difficult  or  some¬ 
times  impossible  to  decide  which  version  is  the  “best” .  For¬ 
tunately,  there  is  an  alternative  to  that  -  if  the  quality  of 
different  versions  cannot  be  compared  directly,  why  not  use 
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Figure  1:  The  combination  model 


all  of  them?  It  is  safe  to  assume  that  all  MT/ASR  systems 
make  mistakes,  and  the  errors  are  usually  not  all  the  same. 
Therefore,  a  combination  model  that  processes  all  different 
versions  concurrently  and  merges  their  output  is  more  likely 
to  compensate  for  each  other. 

It  has  been  widely  accepted  in  the  Information  Retrieval 
(IR)  community  that  combining  multiple  sources  of  evidence 
can  improve  retrieval  performance  [3] .  However,  most  exper¬ 
iments  focus  on  different  representations  of  the  information 
need  [1]  or  multiple  document  models  [6],  and  we  have  not 
seen  any  attempt  to  utilize  parallel  data.  Metasearch  [5] 
merges  the  results  of  multiple  algorithms,  while  our  model 
combines  the  evidence  from  different  content. 

Cross-Language  Information  Retrieval  (CLIR)  finds  docu¬ 
ments  in  one  language  with  queries  in  other  languages.  Com¬ 
monly  used  methods  in  CLIR  include  MT,  parallel  corpora, 
query  expansion,  pseudo-relevance  feedback,  etc.  [2].  The 
combination  model  is  not  restricted  to  one  language,  and 
we  use  both  automatic  and  manual  query  translation  in  the 
experiments. 

2.  COMBINATION  MODEL 

Figure  1  shows  the  structure  of  the  combination  model. 
It  is  composed  of  two  main  steps  -  the  indexing  step  and  the 
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Figure  2:  A  “document”  that  contains  four  different 
versions:  one  in  Mandarin  and  others  from  three 
MT  systems 


query  step. 

The  upper  half  in  Figure  1  is  the  indexing  step.  In  this 
figure,  there  are  n  parallel  datasets.  They  can  be  in  dif¬ 
ferent  languages  or  even  different  media,  but  here  we  only 
deal  with  text  (due  to  the  fact  that  we  are  lack  of  tools 
for  processing  multimedia  data).  Each  corpus  is  then  bro¬ 
ken  down  into  small  segments.  For  text,  these  units  are 
usually  documents/news  stories,  but  they  can  also  be  pas¬ 
sages.  Segments  from  different  datasets  must  be  correctly 
aligned  so  that  they  have  corresponding  content.  In  the  next 
step,  aligned  segments  from  various  sources  are  merged  into 
a  “document”,  marked  by  separate  tags.  Figure  2  shows 
a  sample  “document”,  in  which  tag  MAN  is  for  the  source 
Mandarin  data,  while  MTa,  MTb  and  MTc  are  for  three 
different  MT  systems. 

All  the  merged  data  are  then  indexed  by  Indri,  with  “doc¬ 
uments”  as  the  basic  units.  Indri  is  an  information  retrieval 
system  based  on  language  modeling  and  inference  network 
[4].  It  has  the  ability  to  index  individual  fields.  The  n 
datasets  have  their  own  tags  in  the  index  (like  MAN,  MTa, 
MTb  and  MTc  in  Figure  2),  so  each  version  is  separately 
accessible  by  Indri. 

When  the  query  is  in  one  language  and  the  document  is 
in  another,  we  cannot  match  them  directly  because  their 
vocabularies  do  not  have  any  overlapping.  Under  this  con¬ 
dition,  the  query  needs  to  be  translated  into  the  source  lan¬ 
guage  when  necessary.  Sometimes  even  corpora  in  the  same 
language  may  use  different  words  (note  Williams  vs.  weil¬ 
ianmusi  in  Figure  2),  which  is  hard  to  deal  with  unless  we 
have  the  translation  tables  for  all  MT  systems.  Based  on  the 
confidence  in  different  versions,  there  can  be  various  weights 
associated  with  the  individual  fields.  Here  is  the  final  query 
submitted  to  the  Indri  index: 

#weight(wi  qi.tagi  w2  q2-tag2  ■  ■  ■  wn  qn.tagn)  (1) 
# weight  is  a  belief  operator  in  Indri.  It  has  an  even  number 


System 

MAP 

P  VSlllGmarc 

MTa  (E) 

0.4876 

MTb  (E) 

0.4443 

MTc  (E) 

0.3308 

Manl  (C) 

0.2704 

Man2  (C) 

0.4209 

Combine  1  (E) 

0.4752 

0.6990 

Combine 2  (E+C) 

0.5042 

0.0978 

Combined  (E+C) 

0.5004 

0.1076 

Combinelopt  (E) 

0.5014 

0.0179* 

Combine2opt  (E+C) 

0.5272 

0.0067* 

CombineSopt  (E+C) 

0.5159 

0.0445* 

Table  1:  Ad-hoc  retrieval  results  of  individual  and 
combined  runs 


of  arguments,  where  the  odd-indexed  ones  (iVi)  are  weights 
assigned  to  the  query  in  the  next  argument,  qi.tagi  means 
that  query  qi  (the  original  user  query  or  some  translation  of 
it)  is  used  to  match  the  text  in  the  i-th  field  identified  by 
tagt.  The  similarity  score  of  a  query  q  and  a  document  D  is 
calculated  by: 

n 

sim(q,D)  =  ^  Wisim(qi,  Dtagi)  (2) 

i=l 

where  sim(qi,  Dtagi)  is  the  similarity  score  between  qt  and 
the  text  in  the  tagi  field  of  D1.  The  retrieval  result  returned 
by  query  q  is  a  ranked  list  of  documents  sorted  by  the  sim¬ 
ilarity  in  Equation  2. 

3.  EXPERIMENTS 

It  is  possible  that  the  datasets  in  the  combination  model 
are  in  different  media,  but  we  processed  only  text  in  our 
experiments.  The  reason  is  that  we  do  not  have  the  neces¬ 
sary  tools/models  to  convert/retrieve  multimedia  data.  The 
text  collection  contains  only  newswire,  since  different  ASR 
systems  for  broadcast  news  usually  have  various  document 
boundaries,  which  makes  alignment  between  different  ver¬ 
sions  quite  difficult. 

We  used  the  TDT-5  collection  in  our  experiments.  It  is  a 
news  corpus  collected  from  English,  Mandarin  and  Arabic 
sources,  and  the  time  spans  from  April  to  September  2003. 
This  collection  does  not  contain  broadcast  news  data,  only 
newswire  sources.  We  used  the  Mandarin  data  from  July, 
August  and  September,  making  up  602  files  containing  a 
total  of  27,723  news  stories  (documents). 

We  obtained  three  machine  translations  of  the  27,723  sto¬ 
ries  from  three  different  systems.  (None  of  them  is  the  one 
included  in  the  TDT-5  corpus.)  The  translations  represent 
current  state-of-the-art  technologies,  but  we  do  not  identify 
the  systems  here  because  they  are  preliminary  work,  calling 
them  MTa,  MTb  and  MTc  respectively.  MTa  and  MTb  are 
both  based  on  statistical  models,  while  MTc  uses  rules. 

From  the  250  topics  in  TDT-5,  we  selected  35  that  have 
at  least  5  on-topic  stories  in  our  collection.  The  topic  titles 
(in  English  as  provided  by  the  LDC)  are  used  as  queries. 

All  experiments  are  on  the  combined  index,  with  different 
weight  settings.  The  results  are  shown  in  Table  1. 

1  Details  of  the  retrieval  model  can  be  found  here: 
http://ciir.cs.umass.edu/~metzler/indriretmodel.html. 


TDT-5  topic:  55181 

Topic  title  -  Palestine:  Ahmed 
Qureia tapped  as  next  prime 
minister 

Google  translation  -  EKjJfrfi  Ahmed 
Qureia  T'El  /SSI 

Manual  translation  -  -££•&■ 

xiftsfs  mm  %  T-ff  &a 

Figure  3:  Comparison  of  query  translation  -  auto¬ 
matic  vs.  manual 


•  MTa,  MTb  and  MTc.  Only  one  of  the  three  MT  ver¬ 
sions  gets  weight  1,  all  other  weights  (including  Man¬ 
darin)  are  set  to  0.  Topic  titles  are  used  without  any 
translation. 

•  Manl:  We  translated  the  English  topic  titles  into  sim¬ 
plified  Chinese  with  Google  translation  (http://www. 
google.com/language_tools),  and  then  manually  seg¬ 
mented  terms.  We  use  these  queries  to  match  the  orig¬ 
inal  Mandarin  data  (field  MAN  in  Figure  2)  and  set 
its  weight  to  1.  Each  English  source  gets  a  weight  of 
zero. 

•  Man2 :  The  translation  quality  in  Manl  is  not  very 
good.  In  fact,  some  terms  (especially  names)  were 
not  translated  at  all  (see  Figure  3).  To  get  Mandarin 
queries  in  better  quality,  we  manually  translated  the 
topic  titles  and  replaced  the  queries  in  Manl  with  the 
output.  The  retrieval  accuracy  is  obviously  improved 
with  the  manually  translated  queries. 

•  Combine  1:  All  English  versions  {MTa,  MTb  and  MTc) 
are  combined  with  equal  weights. 

•  Combine 2:  We  combine  all  English  versions  and  orig¬ 
inal  Mandarin.  The  topic  title  is  used  as  the  English 
query,  and  the  output  from  Google  translation  is  the 
Mandarin  query.  Each  of  the  four  fields  gets  the  same 
weight. 

•  Combine 3:  Same  as  Combine 2  except  that  the  Man¬ 
darin  queries  are  manually  translated. 

•  Combinelopt,  Combine2opt  and  Combined opt'  The  weights 
in  the  combined  versions  are  tuned  to  maximize  the 
Mean  Average  Precision  (MAP).  In  Combinelopt,  (MTa, 
MTb,  MTc)  =  (1.25,  1.0,  0.15).  For  Combine 2opt, 
(MTa,  MTb,  MTc,  Manl)  =  (1.25,  1.0,  0.15,  0.7). 
(MTa,  MTb,  MTc,  Man2)  =  (1.25,  1.0,  0.15,  0.85)  in 
Combine3opt- 

The  first  data  column  in  Table  1  shows  the  MAP  of  all 
35  queries.  The  corresponding  number  in  the  second  column 
(only  for  the  combined  runs)  is  the  maximum  of  the  P-values 
from  the  Wilcoxon  rank-sum  tests  between  the  combined  run 
and  each  composing  version.  For  example, 

Pvalue(Combine  1)  =  max(Pvalue(Combinel,  MTa), 
Pvalue(Combinel,  MTb) ,  Pvalue(C ombinel,  MTc))  (3) 


A  number  with  *  means  that  the  combined  run  has  signifi¬ 
cant  improvement  over  all  individual  versions  at  95%  confi¬ 
dence  level. 

Without  any  parameter  tuning,  the  combined  model  is  at 
least  comparable  to  the  best  individual  version.  The  im¬ 
provement  is  more  obvious  when  we  mix  sources  with  dif¬ 
ferent  vocabularies.  Note  that  Manl  has  the  lowest  MAP 
(because  of  the  translation  quality),  but  incorporating  it  into 
the  combination  model  improves  the  performance  remark¬ 
ably.  With  proper  weight  setting,  the  combination  model 
can  be  significantly  better  than  any  individual  version. 

Another  interesting  observation  is  the  comparison  between 
Combine2  and  Combined,  for  both  the  non-optimized  and 
optimized  combination.  Although  the  accuracy  in  Manl 
is  much  lower  than  that  in  Man2,  we  do  not  see  any  im¬ 
provement  in  the  combined  run  when  we  replace  the  Google 
translated  queries  with  the  manual  ones.  The  reason  for 
that  is  still  unclear  to  us,  but  we  can  draw  a  conclusion 
from  the  experiments  -  the  performance  of  the  combination 
model  relies  more  on  the  heterogeneity  than  on  the  quality 
of  individual  sources. 

4.  CONCLUSIONS 

Datasets  often  come  in  different  versions.  Instead  of  se¬ 
lecting  the  best  of  them  with  great  difficulty,  a  combina¬ 
tion  model  can  be  designed  to  utilize  all  available  evidence 
concurrently.  Our  experiments  show  that  this  model  yields 
better  results  than  any  of  the  individual  versions,  especially 
when  the  sources  are  in  multiple  languages.  Due  to  the  lim¬ 
itation  of  available  tools,  we  do  not  have  any  cross-medium 
experiment.  However,  a  combination  model  of  different  me¬ 
dia  (text,  audio,  video,  etc.)  is  exciting  to  contemplate. 
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